{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Linear regression\n", "\n", "A simple machine learning model that can uncover relationships in data.\n", "\n", "Linear regression is a robust machine learning algorithm that is commonly used for modelling and analyzing data.\n", "\n", "It is a simple and effective technique for discovering relationships between variables and predicting future outcomes. The basic premise of linear regression is to find the best linear relationship between the independent and dependent variables in a dataset. Doing so can help identify patterns, trends, and correlations in the data, enabling us to make informed decisions and accurate predictions.\n", "\n", "Linear regression is a versatile tool with applications in various fields, from finance and economics to healthcare and engineering.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"data/housing.csv\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing training data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "x_train, x_test, y_train, y_test = train_test_split(df[[\"housing_median_age\", \"total_rooms\", \"median_income\"]], \n", " df.median_house_value, test_size=.5,\n", " stratify=df.ocean_proximity)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20640, 10)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10320, 3)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_train.shape" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10320, 3)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building the model" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "model = LinearRegression()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression()" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.504466886613274" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score(x_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Improving the model" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "x_val, x_test, y_val, y_test = train_test_split(x_test, y_test)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2580, 3)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_test.shape" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "scaler = preprocessing.StandardScaler()\n", "model = LinearRegression()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "StandardScaler()" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scaler.fit(x_train)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-1.48316536, -0.99365153, -0.87440976],\n", " [ 0.43085565, -0.39327003, 0.80370426],\n", " [ 1.86637141, -0.63119073, -0.92829028],\n", " ...,\n", " [ 1.86637141, -0.92223068, -0.98650237],\n", " [-0.0476496 , -0.29015619, -0.58562075],\n", " [ 0.5903574 , -0.44058635, -1.13308907]])" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_scaled = scaler.transform(x_train)\n", "x_scaled" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression()" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(x_scaled, y_train)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5118435778695601" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score(scaler.transform(x_val), y_val)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.51184357786956" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scaler = preprocessing.MinMaxScaler().fit(x_train)\n", "model = LinearRegression().fit(scaler.transform(x_train), y_train)\n", "model.score(scaler.transform(x_val), y_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting with the Model" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([307579.1482936 , 195257.7614732 , 210653.12200599, ...,\n", " 134249.034224 , 147248.80763583, 347003.6324425 ])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.predict(scaler.transform(x_test))" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8629 470000.0\n", "6090 173300.0\n", "18972 191300.0\n", "5979 240700.0\n", "8751 366900.0\n", " ... \n", "15627 500001.0\n", "2761 81300.0\n", "14886 146300.0\n", "15177 210100.0\n", "17047 500001.0\n", "Name: median_house_value, Length: 2580, dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting the model" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([102231.51493819, 124088.28743787, 619644.91567053])" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.coef_" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-3233.8078487273597" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Experiment how preprocessing can affect your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Model Selection](https://scikit-learn.org/stable/model_selection.html)\n", "- [Scikit-Learn Train Test Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)\n", "- [Scikit-Learn Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }